SemanticScuttle - klotz.me » klotz: dimensionality reduction+pca

klotz: dimensionality reduction* + pca*

Choosing Between PCA and t-SNE for Visualization

PCA and t-SNE are popular dimensionality reduction techniques used for data visualization. This tutorial compares PCA and t-SNE, highlighting their strengths and weaknesses, and provides guidance on when to use each method.

This article from Machine Learning Mastery discusses when to use Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction and data visualization. Here's a summary of the key points:

* **PCA is a linear dimensionality reduction technique.** It aims to find the directions of greatest variance in the data and project the data onto those directions. It's good for preserving global structure but can distort local relationships. It's computationally efficient.
* **t-SNE is a non-linear dimensionality reduction technique.** It focuses on preserving the local structure of the data, meaning points that are close together in the high-dimensional space will likely be close together in the low-dimensional space. It excels at revealing clusters but can distort global distances and is computationally expensive.
* **Key Differences:**
* **Linearity vs. Non-linearity:** PCA is linear, t-SNE is non-linear.
* **Global vs. Local Structure:** PCA preserves global structure, t-SNE preserves local structure.
* **Computational Cost:** PCA is faster, t-SNE is slower.
* **When to use which:**
* **PCA:** Use when you need to reduce dimensionality for speed or memory efficiency, and preserving global structure is important. Good for data preprocessing before machine learning algorithms.
* **t-SNE:** Use when you want to visualize high-dimensional data and reveal clusters, and you're less concerned about preserving global distances. Excellent for exploratory data analysis.
* **Important Considerations for t-SNE:**
* **Perplexity:** A key parameter that controls the balance between local and global aspects of the embedding. Experiment with different values.
* **Randomness:** t-SNE is a stochastic algorithm, so results can vary. Run it multiple times to ensure consistency.
* **Interpretation:** Distances in the t-SNE plot should not be interpreted as true distances in the original high-dimensional space.

In essence, the article advises choosing PCA for preserving overall data structure and speed, and t-SNE for revealing clusters and local relationships, understanding its limitations regarding global distance interpretation.

2026-02-13 Tags: pca, t-sne, dimensionality reduction, visualization, machine learning, data science by klotz

Document Clustering with LLM Embeddings in scikit-learn

This tutorial demonstrates how to perform document clustering using LLM embeddings with scikit-learn. It covers generating embeddings with Sentence Transformers, reducing dimensionality with PCA, and applying KMeans clustering to group similar documents.

2026-02-11 Tags: document clustering, llm embeddings, sentence transformers, scikit-learn, pca, kmeans, dimensionality reduction, natural language processing, nlp by klotz

7 Advanced Feature Engineering Tricks Using LLM Embeddings

This article details seven advanced feature engineering techniques using LLM embeddings to improve machine learning model performance. It covers techniques like dimensionality reduction, semantic similarity, clustering, and more.

The article explores how to leverage LLM embeddings for advanced feature engineering in machine learning, going beyond simple similarity searches. It details seven techniques:

1. **Embedding Arithmetic:** Performing mathematical operations (addition, subtraction) on embeddings to represent concepts like "positive sentiment - negative sentiment = overall sentiment".
2. **Embedding Clustering:** Using clustering algorithms (like k-means) on embeddings to create categorical features representing groups of similar text.
3. **Embedding Dimensionality Reduction:** Reducing the dimensionality of embeddings using techniques like PCA or UMAP to create more compact features while preserving important information.
4. **Embedding as Input to Tree-Based Models:** Directly using embedding vectors as features in tree-based models like Random Forests or Gradient Boosting. The article highlights the importance of careful handling of high-dimensional data.
5. **Embedding-Weighted Averaging:** Calculating weighted averages of embeddings based on relevance scores (e.g., TF-IDF) to create a single, representative embedding for a document.
6. **Embedding Difference:** Calculating the difference between embeddings to capture changes or relationships between texts (e.g., before/after edits, question/answer pairs).
7. **Embedding Concatenation:** Combining multiple embeddings (e.g., title and body of a document) to create a richer feature representation.

2026-02-09 Tags: llm, embeddings, feature engineering, machine learning, semantic similarity, dimensionality reduction, clustering, pca, umap, t-sne by klotz

Using PCA for Outlier Detection

PCA (principal component analysis) can be effectively used for outlier detection by transforming data into a space where outliers are more easily identifiable due to the reduction in dimensionality and reshaping of data patterns.

2024-10-24 Tags: pca, outlier detection, dimensionality reduction, data science, machine learning by klotz

Diving into Word Embeddings with EDA

Exploratory data analysis (EDA) is a powerful technique to understand the structure of word embeddings, the basis of large language models. In this article, we'll apply EDA to GloVe word embeddings and find some interesting insights.

2024-07-12 Tags: word, embeddings, eda, glove, pca, dimensionality reduction, nlp, text, python by klotz

Principal Component Analysis Made Easy: A Step-by-Step Tutorial

This article explains the PCA algorithm and its implementation in Python. It covers key concepts such as Dimensionality Reduction, eigenvectors, and eigenvalues. The tutorial aims to provide a solid understanding of the algorithm's inner workings and its application for dealing with high-dimensional data and the curse of dimensionality.

2024-06-21 Tags: principal component analysis, pca, dimensionality reduction, eigenvectors, eigenvalues, machine learning, data science, statistics by klotz

Principal Component Analysis: The Dragon (Data Scientist) Warrior’s Secret Technique | by John Newcomb | Nov, 2021 | Medium

2021-11-08 Tags: machine learning, pca, dimensionality reduction, python, tutorial, pandas by klotz

Eigenvalues, eigenvectors and PCA | Towards Data Science

2021-09-20 Tags: pca, eigenvalue, eigenvector, machine learning, dimensionality reduction by klotz

Visualizing Higher Dimensional Data Using t-SNE On TensorBoard | by Chiranjeevi Vegi | Medium

2021-09-01 Tags: visualization, dimensionality reduction, t-sne, pca, tensorboard, data science by klotz

Step-by-Step Signal Processing with Machine Learning: PCA, ICA, NMF for source separation, dimensionality reduction

2019-11-13 Tags: signal processing, machine learning, pca, ica, nmf, dimensionality reduction by klotz

First / Previous / Next / Last / Page 1 of 0